Homework 09: Networks, Variable Visualization, and New 1-D Graph Critiques

General instructions for all assignments:

  • Use this file as the template for your submission. Delete the unnecessary text (e.g. this text, the problem statements, etc). That said, keep the nicely formatted “Problem 1”, “Problem 2”, “a.”, “b.”, etc
  • Upload a single R Markdown file (named as: [AndrewID]-HW09.Rmd – e.g. “sventura-HW09.Rmd”) to the Homework 09 submission section on Blackboard. You do not need to upload the .html file.
  • The instructor and TAs will run your .Rmd file on their computers. If your .Rmd file does not knit on our computers, you will be automatically be deducted 10 points.
  • Your file should contain the code to answer each question in its own code block. Your code should produce plots/output that will be automatically embedded in the output (.html) file
  • Each answer must be supported by written statements (unless otherwise specified)
  • Include the name of anyone you collaborated with at the top of the assignment
  • Include the style guide you used below under Problem 0


Problem 0

Organization, Themes, and HTML Output

(5 points)

a.

For all problems in this assignment, organize your output as follows:

  • Use code folding for all code. See here for how to do this.
  • Use a floating table of contents.
  • Suppress all warning messages in your output by using warning = FALSE and message = FALSE in every code block.
  • Use tabs only if you see it fit to do so – this is your choice.

b.

For all problems in this assignment, adhere to the following guidelines for your ggplot theme and use of color:

  • Do not use the default ggplot() color scheme.
  • For any bar chart or histogram, outline the bars (e.g. with color = "black").
  • Do not use both red and green in the same plot, since a large proportion of the population is red-green colorblind.
  • Try to only use three colors (at most) in your themes. In previous assignments, many students are using different colors for the axes, axis ticks, axis labels, graph titles, grid lines, background, etc. This is unnecessary and only serves to make your graphs more difficult to read. Use a more concise color scheme.
  • Make sure you use a white or gray background (preferably light gray if you use gray).
  • Make sure that titles, labels, and axes are in dark colors, so that they contrast well with the light colored background.
  • Only use color when necessary and when it enhances your graph. For example, if you have a univariate bar chart, there’s no need to color the bars different colors, since this is redundant.
  • In general, try to keep your themes (and written answers) professional. Remember, you should treat these assignments as professional reports.

c.

Treat your submission as a formal report:

  • Use complete sentences when answering questions.
  • Answer in the context of the problem.
  • Treat your submission more as a formal “report”, where you are providing details analyses to answer the research questions asked in the problems.

d.

What style guide are you using for this assignment?

library(tidyverse)
library(data.table)
library(forcats)

#  Simple theme with white background, legend at the bottom
my_theme <-  theme_bw() +
  theme(axis.text = element_text(size = 12, color = "indianred4"),
        text = element_text(size = 14, face = "bold", color = "darkslategrey"))

#  Colorblind-friendly color palette
my_colors <- c("#000000", "#56B4E9", "#E69F00", "#F0E442", "#009E73", "#0072B2", 
               "#D55E00", "#CC7947")


Problem 1

(2 points each)

Parallel Coordinates and Radar Charts

a.

library(MASS)
library(GGally)
data(Cars93)
cont_cols <- which(names(Cars93) %in% 
                     c("Cars93", "Price", "MPG.city", "MPG.highway", "EngineSize",
                       "Horsepower", "RPM", "Fuel.tank.capacity", "Passengers",
                       "Length", "Wheelbase", "Width", "Turn.circle", "Weight"))

ggparcoord(Cars93, columns = cont_cols) + 
  aes(color = factor(Type, labels = levels(Cars93$Type))) +
  labs(x = "Variable in Cars93 Dataset", y = "Standard Deviations from Mean",
       color = "Type of Car") +
  ggtitle("Comparing the Cars93 Continuous Variables") +
  scale_color_manual(labels = c("Compact", "Large", "Midsize", "Small", "Sporty", "Van"),
                     values = my_colors) + 
  my_theme +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

b.

Small and compact cars tend to get better gas mileage compared to the other types. Vans and large cars tend to fit more passengers compare to the other types of cars.

c.

ggparcoord(Cars93, columns = cont_cols) + 
  aes(color = factor(Type, labels = levels(Cars93$Type))) +
  labs(x = "Variable in Cars93 Dataset", y = "Standard Deviations from Mean",
       color = "Type of Car") +
  ggtitle("Comparing the Cars93 Continuous Variables") +
  my_theme +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  coord_polar()

The parallel coordinates chart seems easier to read. The y-axis is always vertical in that chart compared to the radar chart, and the lines are straight as opposed to round in the radar chart.

d.

The default y-axis in this implementation is the number of standard deviations away from the mean. We can change the scale such that the minimum value of each variable is 0 and the maximum is 1.

ggparcoord(Cars93, columns = cont_cols) + 
  aes(color = factor(Type, labels = levels(Cars93$Type))) +
  labs(x = "Variable in Cars93 Dataset", y = "Value (scaled between 0 and 1)",
       color = "Type of Car") +
  ggtitle("Comparing the Cars93 Continuous Variables") +
  my_theme +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) 

e.

Potential positive correlations:

  • MPG in the city and MPG on the highway

  • Engine Size and Horsepower

  • Length and Wheelbase

  • Width and Turn Circle

Potential negative correlations:

  • Price and MPG in the city

  • MPG on the highway and Engine Size

  • RPM and Fuel Tank Capacity

  • Horsepower and RPM



Problem 2

(18 points)

Correlation Matrices for Examining Variable Relationships

We’ve covered a few ways to visualize datasets that have many continuous variables (e.g. dimension reduction via multi-dimensional scaling, dendrograms, parallel coordinates plots, radar/spider/star charts, etc). Another option here is a correlation heat map.

A correlation heat map provides a quick visualization of the bivariate correlations between all pairs of continuous variables in the dataset.

  1. (0 points) Code is provided for you to do the following tasks:
  • Load the Cars93 dataset
  • Create a subset with only the continuous variables
cars_cont <- dplyr::select(Cars93, Price, MPG.city, MPG.highway, EngineSize, 
                           Horsepower, RPM, Fuel.tank.capacity, Passengers,
                           Length, Wheelbase, Width, Turn.circle, Weight)
  1. (5 points) Partial code is provided for you to do the following tasks:
  • Follow the example here to create a correlation heap map of the continuous variables in the Cars93 dataset
  • Adjust the color scheme so that correlations of -1 is shown in dark red, a correlation of 0 is shown in light grey, and a correlation of 1 is shown in dark blue.
  • Rotate the x-axis tickmark label 90 degrees
  • Add a title and adjust all axis and legend labels appropriately.
library(reshape2)
correlation_matrix <- cor(cars_cont)
melted_cormat <- melt(correlation_matrix)
ggplot(data = melted_cormat, aes(x=Var1, y=Var2, fill=value)) + 
  geom_tile()

  1. (5 points) Interpret the resulting graph:
  • Which pairs of variables are highly positively correlated?
  • Which are highly negatively correlated?
  • Which have approximately no correlation?
  1. (2 points) Explain the connection between this plot and a heat map.

  2. (1 point) Even though they are very different, what plot for examining the associations between categories of two categorical variables does this remind you of?

  3. (5 points) Recreate the graph in (b), making the following additional adjustments to the correlation matrix:

  • Reorder the variables as you see fit
  • Add the rounded correlation to the plot via geom_text()
  • Only display the upper or lower triangle of the graph


Problem 3

(20 points)

Variable Dendrograms

Another way to visually explore potential associations between continuous variables in our dataset is with dendrograms.

  1. (15 points) Create a “variable dendrogram” of the continuous variables in the Cars93 dataset. To do this:
  • Select the continuous variables from the dataset
  • Compute the correlation matrix for these variables
  • Correlations measure similarity and can be negative, while distances measure dissimilarity and cannot be negative. As such, convert your correlations to instead be one minus the absolute value of the correlations, so that correlations near 1 or -1 will have distances of 0, and correlations near 0 will have distances of 1, e.g.: cormat <- 1 - abs(cormat)
  • Convert your transformed correlation matrix to a distance matrix with the as.dist() function.
  • Submit this distance matrix to hierarchical clustering (hclust()), convert the result to a dendrogram (as.dendrogram()), then plot with ggplot().
  • Color the branches by the four-cluster solution. See the link in HW08 for how to do this.
  • Be sure to adjust your axis labels, add a title, etc.
  • The resulting dendrogram should plot highly correlated variables (positively or negatively correlated) in the same branches / clusters in the dendrogram, while uncorrelated variables will be linked at higher “distances” on the dendrogram.
  1. (5 points) Examine the four-cluster solution. Which variables are in the same cluster? Does it make sense that these are in the same cluster, given both your common-sense understanding of these variables and given the correlation plot you created in Problem 2?

  2. (1 point) What other measures (other than correlation) could you use to measure similarity / dissimilarity between continuous variables for the purposes of a variable dendrogram? (There is not necessarily a right or wrong answer here – just brainstorm ideas.)



Problem 4

(2 points each)

Love Actually Character Network

a.

The FiveThirtyEight article used tables to compare the total amount of money that each actor’s movies made after they appeared in “Love Actually,” as well as the average rating of those movies. The article also used network analysis to compare the relative importance of each character to the movie. They looked at the number of scenes each character appeared in, how many scenes each character appeared with the other characters, and what demographic category each character belongs to.

b.

library(dendextend)
love_adjacency <- read_csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/love-actually/love_actually_adjacencies.csv")

simpleCap <- function(x) {
    s <- strsplit(x, " ")[[1]]
    paste(toupper(substring(s, 1, 1)), substring(s, 2),
          sep = "", collapse = " ")
}

actor_names <- chartr("_", " ", colnames(love_adjacency[, -1]))
actor_names <- sapply(actor_names, simpleCap)
colnames(love_adjacency) <- c("actors", actor_names)
love_adjacency$actors <- actor_names

love_dist <- 1 / (1 + as.dist(love_adjacency[, -1]))

love_dist %>% 
  hclust(method = "average") %>%
  as.dendrogram() %>%
  dendextend::set("labels_cex", 0.6) %>%
  ggplot(horiz = T) + 
  my_theme +
  labs(title = "Hierarchy of Love Actually Characters",
       y = "Pairwise Euclidean Distance",
       x = "") +
  scale_x_continuous(breaks = NULL)

c.

The dendrogram shows that the most connected characters are those played by Emma Thompson and Alan Rickman, by Andrew Lincoln and Keira Knightley, and by Abdul Salis and Kris Marshall. Rowan Atkinson’s character seems to be the one that is least connected to the rest of the cast. The clusters seem to be Emma Thompson, Alan Rickman, and Heike Makatsch, Hugh Grant, Bill Nighly, and Liam Neeson, Colin Firth, Keira Nightley, and Andrew Lincoln, and Martin Freeman, Kris Marshall, and Abdul Salis.

d.

The ggraph package makes network visualizations, such as dendrograms and network graphs. It was released on CRAN on February 24, 2017. The function ggraph() is similar to the function ggplot().

e.

We would create a dendrogram using the argument layout = dendrogram in the ggraph() function.

library(igraph)
library(ggraph)
actor_hclust <- hclust(love_dist, method = "average")
dend_labels <- actor_hclust$labels[actor_hclust$order]

love_dist %>% 
  hclust(method = "average") %>%
  as.dendrogram() %>%
  ggraph(layout = "dendrogram") + 
  geom_edge_elbow() +
  geom_node_text(aes(filter = leaf), label = dend_labels,
                 angle = 90, hjust = 0, vjust = -0.7) +
  my_theme +
  labs(title = "Hierarchy of Love Actually Characters",
       y = "Pairwise Euclidean Distance",
       x = "") +
  scale_x_continuous(breaks = NULL)

f.

net_names <- love_adjacency[, 1]
net_graph <- graph_from_adjacency_matrix(as.dist(love_adjacency[,-1]))

ggraph(net_graph, layout = 'kk') + 
  geom_edge_link() +
  geom_node_label(aes(label = net_names),
                  size = 2.2) +
  my_theme +
  labs(title = "Network of Love Actually Characters",
       y = "",
       x = "") +
  scale_x_continuous(breaks = NULL) + 
  scale_y_continuous(breaks = NULL) 

g.

ggraph(net_graph, layout = 'kk') + 
  geom_edge_density() +
  geom_edge_diagonal(edge_width = 1.5) +
  geom_node_label(aes(label = net_names)) +
  my_theme +
  labs(title = "Network of Love Actually Characters",
       y = "",
       x = "") +
  scale_x_continuous(breaks = NULL) + 
  scale_y_continuous(breaks = NULL) 

h.

(BONUS: 3 points)

actor_gender <- c("Male", "Female", 
                     "Male", "Male", 
                     "Male", "Male", 
                     "Female", "Female", 
                     "Female", "Male", 
                     "Male", "Male", 
                     "Male", "Male")

V(net_graph)$names <- love_adjacency$actors
V(net_graph)$gender <- actor_gender

ggraph(net_graph, layout = 'kk') + 
  geom_edge_density() +
  geom_edge_diagonal(edge_width = 1.5) +
  geom_node_label(aes(label = net_names, color = gender),
                  size = 2.2) +
  my_theme +
  labs(title = "Network of Love Actually Characters",
       y = "",
       x = "",
       color = "Gender") +
  scale_x_continuous(breaks = NULL) + 
  scale_y_continuous(breaks = NULL) +
  scale_colour_manual(values = my_colors[c(2, 3)]) +
  facet_nodes(~ gender)



Problem 5

(4 points each)

Waffle Charts

a.

#  Set up data to create the waffle chart
library(MASS)
data(Cars93)
var <- Cars93$Type  # the categorical variable you want to plot 
nrows <- 9  #  the number of rows in the resulting waffle chart
categ_table <- floor(table(var) / length(var) * (nrows*nrows))

temp <- rep(names(categ_table), categ_table)
df <- expand.grid(y = 1:nrows, x = 1:nrows) %>%
  mutate(category = sort(c(temp, sample(names(categ_table), 
                                        nrows^2 - length(temp), 
                                        prob = categ_table))))
# NOTE: if sum(categ_table) is not nrows^2, it will need adjustment to make the sum = nrows^2.

#  Make the Waffle Chart
ggplot(df, aes(x = x, y = y, fill = category)) + 
  geom_tile(color = "black", size = 0.5) +
  scale_x_continuous(breaks = NULL) +
  scale_y_continuous(breaks = NULL) +
  scale_fill_brewer(palette = "Set3") +
  labs(title = "Waffle Chart of Car Type",
       caption = "Source:  Cars93 Dataset", 
       fill = "Car Type",
       x = NULL, y = NULL) + 
  my_theme  #  Use your theme

Waffle charts are used display the proportions of each category for discrete variables. I would use waffle charts to visualize categorical variables when we care about the proportions of observations in each category. Note that it is possible to scale waffle charts so that the total number of boxes in the chart is equal to the sample size (or, at least, approximately so). This is highly recommended for anyone who is planning to use these in the future.

b.

movies <- read_csv("https://raw.githubusercontent.com/sventura/315-code-and-datasets/master/data/imdb_test.csv")

rating_order <- c("Not Rated","G", "PG", "PG-13", "R", "NC-17")
movies <- mutate(movies, content_rating = fct_recode(content_rating,
                                           "Not Rated" = "N/A or Unrated",
                                           "G" = "Validated for All Ages (G)"),
               content_rating = fct_relevel(content_rating, rating_order))

var <- movies$content_rating  # the categorical variable you want to plot 
nrows <- 25 # the number of rows in the resulting waffle chart
categ_table <- floor(table(var) / length(var) * (nrows*nrows))
categ_table <- categ_table[rating_order]

temp <- rep(names(categ_table), categ_table)
df <- expand.grid(y = 1:nrows, x = 1:nrows) %>%
  mutate(category = sort(c(temp, sample(names(categ_table), 
                                        nrows^2 - length(temp), 
                                        prob = categ_table))),
         category = factor(category,levels = rating_order),
         category = category[order(category)])


ggplot(df, aes(x = x, y = y, fill = category)) + 
  geom_tile(color = "black", size = 0.5) +
  scale_x_continuous(breaks = NULL) +
  scale_y_continuous(breaks = NULL) +
  scale_fill_brewer(palette = "Blues") +
  labs(title = "Waffle Chart of Imdb Movie Content Rating",
       caption = "Source:  Imdb Dataset", 
       fill = "Content Rating",
       x = NULL, y = NULL) + 
  my_theme

nrows <- 50 # the number of rows in the resulting waffle chart
categ_table <- floor(table(var) / length(var) * (nrows*nrows))
categ_table <- categ_table[rating_order]

temp <- rep(names(categ_table), categ_table)
df <- expand.grid(y = 1:nrows, x = 1:nrows) %>%
  mutate(category = sort(c(temp, sample(names(categ_table), 
                                        nrows^2 - length(temp), 
                                        prob = categ_table))),
         category = factor(category,levels = rating_order),
         category = category[order(category)])

ggplot(df, aes(x = x, y = y, fill = category)) + 
  geom_tile(color = "black", size = 0.5) +
  scale_x_continuous(breaks = NULL) +
  scale_y_continuous(breaks = NULL) +
  scale_fill_brewer(palette = "Reds") +
  labs(title = "Waffle Chart of Imdb Movie Content Rating",
       subtitle = "Finer Focus",
       caption = "Source:  Imdb Dataset", 
       fill = "Content Rating",
       x = NULL, y = NULL) + 
  my_theme

I slightly prefer the waffle plot with 50 rows since we can see NC-17 class. Additionally this finer grid more closely reflects true proportions.

c.

Although rough proportions might be obtainable for each class, waffle charts have many weaknesses. Waffle charts (1) fail to provide easily identifiable values / proportions for each group - semi-randomly dispersing the “left-off” blocks, (2) don’t express sample size unless you explicitly design them to do so, (3) potentially eliminate very small classes, and (4) have some expectation of the categorizes to be ordered.

(1-3) cause distortion.

Problem (1) can be corrected with a flipped stacked proportion plot. The other problems can be addressed using a histogram.

In general, waffle charts are very similar to pie charts, so they would inherit most or all of the disadvantages of pie charts.



Problem 6

(1 point each)

Arc Pie Charts

a.

library(ggforce)
Cars93 %>% group_by(Type) %>% 
  summarize(count = n()) %>% 
  mutate(max = max(count),
         focus_var = 0.2 * (count == max(count))) %>%
  ggplot() + geom_arc_bar(aes(x0 = 0, y0 = 0, r0 = 0.8, r = 1, 
                              fill = Type, amount = count), 
                          stat = 'pie') +
  labs(x = NULL, y = NULL, fill = "Type of Car",
       title = "Proportions of Car Types in Line-up") + 
  my_theme + 
  scale_fill_manual(values = my_colors ) + 
  theme(axis.title = element_blank(),
        axis.text = element_blank(),
        axis.ticks = element_blank(),
        panel.grid = element_blank(),
        panel.border = element_blank()) 

b.

r0 is the radius from the center at which the figure is draw, although you can put any numerical value into the code and it will still run (including negative numbers and values above r), a logical constraint would be between 0 and the value of r. When r0 is 0 we have a pie chart.

c.

Cars93 %>% group_by(Type) %>% 
  summarize(count = n()) %>% 
  mutate(max = max(count),
         focus_var = 0.2 * (count == max(count))) %>%
  ggplot() + geom_arc_bar(aes(x0 = 0, y0 = 0,  r0 = .8, r = 1, 
                              fill = Type, amount = count, explode = focus_var), 
                          stat = 'pie') +
  labs(x = NULL, y = NULL, fill = "Type of Car",
       title = "Proportions of Car Types in Line-up") + 
  my_theme + scale_fill_manual(values = my_colors) + 
  theme(axis.title = element_blank(),
        axis.text = element_blank(),
        axis.ticks = element_blank(),
        panel.grid = element_blank(),
        panel.border = element_blank())

explode shifts specific classes’ part of the arc pie chart outward by the amount specified (in this case .2). In this case we see that it tries to highlight the proportion to Minvans by removing it from the rest of the arc pie plot.

d.

Cars93 %>% group_by(Type) %>% 
  summarize(count = n()) %>% 
  mutate(max = max(count),
         focus_var = 0.2 * (count == min(count))) %>%
  ggplot() + geom_arc_bar(aes(x0 = 0, y0 = 0,  r0 = .8, r = 1, 
                              fill = Type, amount = count, explode = focus_var), 
                          stat = 'pie') +
  labs(x = NULL, y = NULL, fill = "Type of Car",
       title = "Proportions of Car Types in Line-up") + 
  my_theme + scale_fill_manual(values = my_colors) + 
  theme(axis.title = element_blank(),
        axis.text = element_blank(),
        axis.ticks = element_blank(),
        panel.grid = element_blank(),
        panel.border = element_blank())

e.

You could use an arc pie chart to visualize the proportions of a variable with a few categories (say no more than 10).

The issues of arc pie charts are very similar to pie charts, the area of the arc is not directly related to the actual number, as the area is \(\text{proportion}\cdot 2\pi (r^2 - r_0^2)\). As such the arc pie chart creates distortion. Similar to waffle charts, we cannot see the sample size. It is also hard to compare different arcs that are not right next to each other / not properly aligned.

The explode attribute does allow us to highlight a particular class, but without additional commentary it might further increase the difficulty of comparing the classes. This can be explicitly seen when switching back and forth between part (a) and part (d), as part (d) actually makes it harder to see that the proportion of Vans is less than the proportion of Large cars.



Problem 7

(5 points each)

Zoom Zoom

a.

library(ggforce)
library(ggplot2)

movies <- mutate(movies, profit = gross - budget)
ggplot(data = movies, 
       aes(x = jitter(title_year,amount = .3), y = movie_facebook_likes)) + 
  geom_point(alpha = .2) + 
  facet_zoom(x = title_year %in% 2004:2017) +
  scale_color_brewer(palette = "Set3") +
  labs(x = "Year of Movie Release (Jittered)",
       y= "Number of Facebook Likes",
       title = "Facebook Likes Over the Years",
       subtitle = "Focusing on Movies released after Facebook was created (Starting 2004)",
       color = "Content Rating",
       caption = "Imdb Database"
       ) +
  my_theme

b.

We are able to examine the distribution Movies’ Facebook like for movies release after Facebook was created. Focus on these years helps highly the increase of Facebook likes for those movies that could get likes directly when they were released and had the most social and media buzz.



Problem 8

(Free 4 points)

Read Your HW07 and Lab08 Feedback On Blackboard

This is for your own good. Ignore this advice at your own peril.